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Abstract 

We propose a visual recognition system that is designed 
for fine-grained visual categorization. The system is com¬ 
posed of a machine and a human user. The user, who is un¬ 
able to carry out the recognition task by himself is interac¬ 
tively asked to provide two heterogeneous forms of informa¬ 
tion: clicking on object parts and answering binary ques¬ 
tions. The machine intelligently selects the most informative 
question to pose to the user in order to identify the object’s 
class as quickly as possible. By leveraging computer vision 
and analyzing the user responses, the overall amount of hu¬ 
man effort required, measured in seconds, is minimized. We 
demonstrate promising results on a challenging dataset of 
uncropped images, achieving a significant average reduc¬ 
tion in human effort over previous methods. 


1. Introduction 

Vision researchers have become increasingly interested 
in recognition of parts [2, 8, 21], attributes [6, 11, 12], and 
fine-grained categories ( e.g. specific species of birds, flow¬ 
ers, or insects) [1,3, 14, 15]. Beyond traditionally studied 
basic-level categories, these interests have led to progress 
in transfer learning and learning from fewer training ex¬ 
amples [7, 8, 10, 15, 21], larger scale computer vision al¬ 
gorithms that share processing between tasks [15, 16], and 
new methodologies for data collection and annotation [2,4]. 

Parts, attributes, and fine-grained categories push the 
limits of human expertise and are often inherently ambigu¬ 
ous concepts. For example, perception of the precise lo¬ 
cation of a particular part (such as a bird’s beak) can vary 
from person to person, as does perception of whether or 
not an object is shiny. Fine-grained categories are usually 
recognized only by experts (e.g. the average person cannot 
recognize a Myrtle Warbler), while one can recognize im¬ 
mediately basic categories like cows and bottles. 

We propose a key conceptual simplification: that humans 
and computers alike should be treated as valuable but ulti¬ 
mately imperfect sources of information. Humans are able 




Ground Truth Part Locations 


IMAGE CLASS: Sooty Albatross 


Q3: Is the bill black? 
yes (4.274 s) 


Q2: Click on the 
(3.033 s) 


Figure 1. Our system can query the user for input in the form of 
binary attribute questions or part clicks. In this illustrative exam¬ 
ple, the system provides an estimate for the pose and part locations 
of the object at each stage. Given a user-clicked location of a part, 
the probability distributions for locations of the other parts in each 
pose will adjust accordingly. The rightmost column depicts the 
maximum likelihood estimate for part locations. 


to detect and broadly categorize objects, even when they 
do not recognize them, as well as carry out simple mea¬ 
surements such as telling color and shape; human errors 
arise primarily because (1) people have limited experiences 
and memory, and (2) people have subjective and percep¬ 
tual differences. In contrast, computers can run identical 
pieces of software and aggregate large databases of infor¬ 
mation. They excel at memory-based problems like recog¬ 
nizing movie posters but struggle at detecting and recog¬ 
nizing objects that are non-textured, immersed in clutter, or 
highly shape-deformable. 

In order to achieve a unified treatment of humans and 
computers, we introduce models and algorithms that ac¬ 
count for errors and inaccuracies of vision algorithms (part 
localization, attribute detection, and object classification) 
and ambiguities in multiple forms of human feedback (per¬ 
ception of part locations, attribute values, and class labels). 
The strengths and weaknesses of humans and computers for 


















(a) Pose clusters (b) User interface for part locations input 


Figure 2. 2(a) Images in our dataset are clustered by /c-means on the spatial offsets of part locations from parent part locations. Semantic 
labels of clusters were manually assigned by visual inspection. Left/right orientation is in reference to the image. 2(b) The user clicks on 
his/her perceived location of the breast (x p , y p ), which is shown as a red X and is assumed to be near the ground truth location (x p , y p ). 
The user also can click a checkbox indicating part visibility v p . Features ip p (x, 6 P ) can be extracted from a box around 9 P . 


these modalities are combined using a single principled ob¬ 
jective function: minimizing the expected amount of time 
to complete a given classification task. 

Consider for example different types of human annota¬ 
tion tasks in the domain of bird species recognition. For the 
task “Click on the beak ,” the location a human user clicks 
is a noisy representation of the ground truth location of the 
beak. It may not in isolation solve any single recognition 
task; however, it provides information that is useful to a 
machine vision algorithm for localizing other parts of the 
bird, measuring attributes (e.g. cone-shaped), recognizing 
actions (e.g. eating or flying), and ultimately recognizing 
the bird species. The answer to the question “Is the belly 
striped?” similarly provides information towards recogniz¬ 
ing a variety of bird species. Each type of annotation takes 
a different amount of human time to complete and provides 
varying amounts of information. 

Our models and algorithms combine all such sources of 
information into a single principled framework. We have 
implemented a practical real-time system 1 for bird species 
identification on a dataset of over 200 categories. Recog¬ 
nition and pose registration can be achieved automatically 
using computer vision; the system can also incorporate hu¬ 
man feedback when computer vision is unsuccessful by in¬ 
telligently posing questions to human users (see Figure 1). 

The contributions of this paper are: (1) We introduce 
models and algorithms for object detection, part localiza¬ 
tion, and category recognition that scale efficiently to large 
numbers of categories. Our algorithms can localize and 
classify objects on a 200-class dataset in a fraction of a sec¬ 
ond, using part and attribute detectors that are shared among 
classes. (2) We introduce a formal model for evaluating the 
usefulness of different types of human input that takes into 
account varying levels of human error, time spent, and in¬ 
formativeness in a multiclass or multitask setting. We intro¬ 
duce fast algorithms that are able to predict the informative¬ 
ness of 312 binary questions and 13 part click questions in a 
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fraction of a second. All such computer vision algorithms, 
forms of user input, and question selection techniques are 
combined into an integrated framework. (3) We present a 
thorough experimental comparison of a number of methods 
for optimizing human input. 

The structure of the paper is as follows: In Section 2, we 
review related work. We define the problem and describe 
the algorithm in Sections 3 and 4. In Section 5 we discuss 
implementation details, in Section 6 we present empirical 
results, and finally in Section 7 we discuss future work. 

2. Related Work 

Our work extends prior work by Branson et al. [3], 
who introduced a system combining human interaction with 
computer vision that was applied to bird species classifi¬ 
cation. They used an information-theoretic framework to 
intelligently select attribute questions such as “Is the belly 
spotted?”, “Is the wing white?”, etc. to identify the true bird 
species as quickly as possible. Our work has 3 main dif¬ 
ferences: (1) While [3] used non-localized computer vision 
methods based on bag-of-words features extracted from the 
entire image, we use localized part and attribute detectors. 
Thus [3] relied on experiments with test images cropped by 
ground truth bounding boxes; in contrast, our experiments 
are performed on uncropped images in unconstrained en¬ 
vironments. (2) Whereas [3] incorporated only one type 
of user input - binary questions pertaining to attributes - 
we allow heterogeneous forms of user input including user- 
clicked part locations. Users can click on any pixel location 
in an image, introducing significant algorithmic and com¬ 
putational challenges as we must reason over hundreds of 
thousands of possible click point and part locations. (3) 
Whereas [3] measured human effort in terms of the total 
number of questions asked, we introduce an extended ques¬ 
tion selection criterion that factors in the expected amount 
of human time needed to answer each type of question. 

A number of papers have recently addressed fine-grained 
categorization [1, 3, 13, 14, 15, 22]. One important differ- 

























ence between this paper and prior work is that we use local¬ 
ized computer vision algorithms based on part and attribute 
detectors (in contrast, [3, 13, 14, 15] rely on bag-of-words 
methods). The differences between fine-grained categories 
are subtle, such that it is likely that lossy representations 
such as bag-of-words are insufficient. Bird field guides [19] 
suggest that humans use strongly localized methods based 
on parts and attributes to perform bird species recognition, 
justifying our approach. Also, we introduce an extended 
version of the CUB-200 dataset [1 ] that consists of 200 
bird species and nearly 12, 000 images, each of which is la¬ 
beled with part locations and attribute annotations (example 
images are shown in Figure 2(a)). Additional discussion of 
related work covered in [3] is not revisited here. 

Our integrated approach builds on two areas in com¬ 
puter vision: part-based models and attribute-based learn¬ 
ing, which have both been explored in depth in other 
works. Specifically, we use a part representation simi¬ 
lar to a Felzenszwalb-style deformable part model [8, 9] 
(sliding window HOG-based part detectors fused with tree- 
structured spatial dependencies). Whereas most attribute- 
based methods [6, 12] use non-localized classifiers, [5, 
20] incorporate object or part-level localization with at¬ 
tribute detectors. Our methods differ from earlier work 
on parts and attributes by (1) the specific combination of 
a Felzenszwalb-style deformable part model with localized 
attribute detectors, (2) the additional ability to combine part 
and attribute models with different types of user input, and 
(3) the deployment of such methods on a dataset of larger 
scale, localizing 200 object classes, 13 parts, 11 aspects, 
and 312 binary attributes in a fraction of a second. 

3. Framework for Visual Recognition 

In this section, we introduce a principled framework for 
integrating part-based detectors, multi-class categorization 
algorithms, and different types of human feedback into a 
common probabilistic model. We also introduce efficient 
algorithms for inferring and updating object class and lo¬ 
calization predictions as additional user input is obtained. 
We begin by formally defining the problem. 


a c = [aI ...a^], a\ G 0,1. We extend this model to include 
part localized attributes, such that each attribute a G 1...A 
can optionally be associated with a part part (a) G 1...P 
( e.g . the attributes white belly and striped belly are both as¬ 
sociated with the part belly). In this case, we express the set 
of all ground truth part locations for a particular object as 
© = { 0i...0p }, where the location 6 p of a particular part p 
is represented as an x pi y p image location, a scale s p , and an 
aspect v p (e.g. side view left, side view right, frontal view, 
not visible, etc.): 

0 p = \x P i y P f s p , Vp\. ( 1 ) 

Note that the special aspect not visible is used to handle 
parts that are occluded or self-occluded. 

We can optionally combine our computer vision algo¬ 
rithms with human input, by intelligently querying user in¬ 
put at runtime. A human is capable of providing two types 
of user input which indirectly provide information relevant 
for predicting the object’s class: mouse click locations 0 P 
and attribute question answers di. The random variable 
Op represents a user’s input of the part location 0 P , which 
may differ from user to user due to both clicking inaccura¬ 
cies and subjective differences in human perception (Figure 
2(b)). Similarly, di is a random variable defining a user’s 
perception of the attribute value a ? > 

We assume a pool of A + P possible questions that can 
be posed to a human user Q = {q\...qA, 
where the first A questions query d{ and the remaining P 
questions query 0 P . Let Aj be the set of possible answers to 
question qj. At each time step £, our algorithm considers the 
visual content of the image and the current history of ques¬ 
tion responses to estimate a distribution over the location of 
each part, predict the probability of each class, and intelli¬ 
gently select the next question to ask qj(t). A user provides 
the response Uj^ to a question qj(t), which is the value of 
0 P or di for part location or attribute questions, respectively. 
The set of all user responses up to timestep t is denoted by 
the symbol U l = {uj(iy..Uj^}. We assume that the user 
is consistent in answering questions and therefore the same 
question is never asked twice. 


3.1. Problem Definition and Notation 

Given an image x , our goal is to predict an object class 
from a set of C possible classes (e.g. Myrtle Warbler, Blue 
Jay, Indigo Bunting) within a common basic-level category 
(e.g. Birds). We assume that the C classes fall within a 
reasonably homogeneous basic-level category such as birds 
that can be represented using a common vocabulary of P 
parts (e.g. head, belly, wing), and A attributes (e.g. cone- 
shaped beak, white belly, striped breast). We use a class- 
attribute model based on the direct-attribute model of Lam- 
pert et al. [ 1 ], where each class cG 1... C is represented us¬ 
ing a unique, deterministic vector of attribute memberships 


3.1.1 Probabilistic Model 


Our probabilistic model incorporating both computer vision 
and human user responses is summarized in Figure 3(b). 
Our goal is to estimate the probability of each class given 
an arbitrary collection of user responses U l and observed 
image pixels x: 


p{c\U\x) 


p(a c , iy\x) 

T, c p( aC , ut \xY 


( 2 ) 


which follows from the assumption of unique, class- 
deterministic attribute memberships a c [12]. We can in- 



3.2.2 Part Detection 




(a) Part Model (b) Part-Attribute Model 

Figure 3. Probabilistic Model. 3(a): The spatial relationship be¬ 
tween parts has a hierarchical independence structure. 3(b): Our 
model employs attribute estimators, where part variables 0 P are 
connected using the hierarchical model shown in 3(a). 

corporate localization information © into the model by in¬ 
tegrating over all possible assignments to part locations 

p(a c ,U t \x) = f p(a c ,U t ,Q\x)dQ. (3) 

We can write out each component of Eq 3 as 

p(a c , P, ©|x) = p(a c |0, x)p(Q\x)p(U t \a c , 0, x) (4) 

where p( a c |0,x) is the response of a set of attribute de¬ 
tectors evaluated at locations 0 , p(0 \x) is the response of 
a part-based detector, and piU 1 |a c , 0, x) models the way 
users answer questions. We describe each of these proba¬ 
bility distributions in Sections 3.2.1, 3.2.2, and 3.3 respec¬ 
tively and describe inference procedures for evaluating Eq 3 
efficiently in Section 3.4. 

3.2. Computer Vision Model 

As described in Eq 4, we require two basic types of com¬ 
puter vision algorithms: one that estimates attribute proba¬ 
bilities p(a c |0, x) on a particular set of predicted part loca¬ 
tions 0, and another that estimates part location probabili¬ 
ties p(Q\x). 

3.2.1 Attribute Detection 

Using the independence assumptions depicted in Figure 
3(b), we can write the probability 

p(a C \e,x)= JJ PKI^partK), A (5) 

a?£a= 

Given a training set with labeled part locations 0 part ( a .), one 
can use standard computer vision techniques to learn an es¬ 
timator for each p(ui|6 > part ( a .), x). In practice, we train a 
separate binary classifier for each attribute, extracting lo¬ 
calized features from the ground truth location 0 par t( ai )- 
As in [12], we convert attribute classification scores zi = 
f a {x; part(a^)) to probabilities by fitting a sigmoid func¬ 
tion a(j a Zi) and learning the sigmoid parameter 7 a using 
cross-validation. When u part ( a .) = not visible, we assume 
the attribute detection score is zero. 


We use a pictorial structure to model part relationships (see 
Figure 3(a)), where parts are arranged in a tree-structured 
graph T = (V,E). Our part model is a variant of the model 
used by Felzenszwalb et al. [ 8 ], which models the detection 
score g(x;0) as a sum over unary and pairwise potentials 
log(p(0|x)) oc g(x; 0) with 

p 

g(x; 0) * ^ ip(x; 6 P ) + ^ A (9 p ,d q ) (6) 

P=1 ( p,q)eE 

where each unary potential ^{x\0 p ) is the response of a 
sliding window detector, and each pairwise score A(0 p , Q q ) 
encodes a likelihood over the relative displacement between 
adjacent parts. We use the same learning algorithms and 
parametrization of each term in Eq 6 as in [22]. Here, 
parts and aspects are semantically defined, multiple aspects 
are handled using mixture models, and weight parameters 
for appearance and spatial terms are learned jointly using a 
structured SVM [1 ]. After training, we convert detection 
scores to probabilities p(Q\x) oc exp (7 g{pc\ ©)), where 7 
is a scaling parameter that is learned using cross-validation. 

3.3. User Model 

Readers interested in a computer-vision-only system 
with no human-in-the-loop can skip to Section 3.4. We as¬ 
sume that the probability of a set of user responses U l can 
be expressed in terms of user responses that pertain to part 
click locations U@ C U l and user responses that pertain to 
attribute questions C U l . We assume a user’s percep¬ 
tion of the location of a part 0 P depends only on the ground 
truth location of that part 0 P , and a user’s perception of an 
attribute ai depends only on the ground truth attribute a\\ 

PiU* |a c ,0,x)= I P[ p(0 p |0 P ) J | F[ p(a»K) J • 
\peL% J \ateui J 

We describe our methods for estimating p(0 p \0 p ) and 
p{ai\af) in Sections 3.3.1 and 3.3.2 respectively. 

3.3.1 Modeling User Click Responses 

Our interface for collecting part locations is shown in Figure 
2(b). We represent a user click response as a triplet 0 P = 
{x p ,y p ,v p }, where (x p , y p ) is a point that the user clicks 
with the mouse and v p E {visible , not visible } is a binary 
variable indicating presence/absence of the part. 

Note that the user click response 6 P models only part lo¬ 
cation and visibility, whereas the true part location 0 P also 
includes scale and aspect. This is done in order to keep the 
user interface as intuitive as possible. On the other hand, in¬ 
corporating scale and aspect in the true model is extremely 









important - the relative offsets and visibility of parts in left 
side view and right side view will be dramatically different. 
We model a distribution over user click responses as 

p{0p\0p) — p{x p , y p \x p , y p , Sp)p{vp\vp) (8) 

where the relative part click locations are Gaussian dis- 
tributed ( - s jp L , ~~ yp ) ~ and each p(v p \v p ) 

is a separate binomial distribution for each possible value of 
v p . The parameters of these distributions are estimated us¬ 
ing a training set of pairs (0 p ,Q p ). This model of user click 
responses results in a simple, intuitive user interface and 
still allows for a sophisticated and computationally efficient 
model of part localization (Section 3.4). 

3.3.2 Attribute Question Responses 

We use a model of attribute user responses similar to [3]. 
We estimate each p(a*|a*) as a binomial distribution, with 
parameters learned using a training set of user attribute re¬ 
sponses collected from MTurk. As in [3], we allow users to 
qualify their responses with a certainty parameter guessing , 
probably , or definitely , and we incorporate a Beta prior to 
improve robustness when training data is sparse. 

3.4. Inference 

We describe the inference procedure for estimating per- 
class probabilities p(c|f/ t , x) (Eq 2), which involves evalu¬ 
ating f Q p( a c , U l , ©| x)dQ. While this initially seems very 
difficult, we note that all user responses a l p and 6 p are ob¬ 
served values pertaining only to a single part, and attributes 
a c are deterministic when conditioned on a particular choice 
of class c. If we run inference separately for each class c, 
all components of Eqs 5 and 7 can simply be mapped into 
the unary potential for a particular part. Evaluating Eq 2 
exactly is computationally similar to evaluating a separate 
pictorial structure inference problem for each class. 

On the other hand, when C is large, running C inference 
problems can be inefficient. In practice, we use a faster 
procedure which approximates the integral in Eq 3 as a sum 
over K strategically chosen sample points: 

f p(a c , U l , Q\x)dQ 

K 

~ ^2p(U t \a c ,Q t k ,x)p(a c \Q t k ,x)p(Q t k \x) (9) 
k=1 

K 

= p{U t a \a c )y2p( aC \®ki x )p(Ue\@l> x )p(®k\ x )- 

k=1 

We select the sample set 0^...0^ as the set of all local max¬ 
ima in the probability distribution p(t/@ |@)p(0|x). The set 
of local maxima can be found using standard methods for 


maximum likelihood inference on pictorial structures and 
then running non-maximal suppression, where probabilities 
for each user click response p(0 p \0 p ) are first mapped into 
a unary potential fi>(x; 6 p , 6 p ) (see Eq 6) 

ip(x\9 p ,e p ) ='ip(x;d p ) + logp(d p \O p ). (10) 

The inference step takes time linear in the number of parts 
and pixel locations 2 and is efficient enough to run in a frac¬ 
tion of a second with 13 parts, 11 aspects, and 4 scales. 
Inference is re-run each time we obtain a new user click re¬ 
sponse 0 p , resulting in a new set of samples. Sampling as¬ 
signments to part locations ensures that attribute detectors 
only have to be evaluated on K candidate assignments to 
part locations; this opens the door for more expensive cate¬ 
gorization algorithms (such as kernelized methods) that do 
not have to be run in a sliding window fashion. 

4. Selecting the Next Question 

In this section, we introduce a common framework for 
predicting the informativeness of different heterogeneous 
types of user input (including binary questions and mouse 
click responses) that takes into account the expected level 
of human error, informativeness in a multitask setting, ex¬ 
pected annotation time, and spatial relationships between 
different parts. Our method extends the expected informa¬ 
tion gain criterion described in [. ]. 

Let IG t (qj) be the expected information gain 
IG(c; Uj\x, U 1 ) from asking a new question qy 

IG t (<&) = Y, (11) 

iijEAj 

H ( C/t ) = - 5>( c l*’ U f ) \ogp(c\x, U l ) (12) 

C 

where H (U f ) is shorthand for the conditional class entropy 
H(c| x, U 1 ). Evaluating Eq 11 involves considering every 
possible user-supplied answer Uj G Aj to that question, and 
recomputing class probabilities p(c\x, U l , Uj). For yes/no 
attribute questions (querying a variable a©, this is compu¬ 
tationally efficient because the number of possible answers 
is only two, and attribute response probabilities p(U^ |a c ) 
are assumed to be independent from ground truth part loca¬ 
tions (see Eq 9). 

4.1. Predicting Informativeness of Mouse Clicks 

In contrast, for part click questions the number of possi¬ 
ble answers to each question is equal to the number of pixel 
locations, and computing class probabilities requires solv¬ 
ing a new inference problem (Section 3.4) for each such lo¬ 
cation, which quickly becomes computationally intractible. 

2 Maximum likelihood inference involves a bottom-up traversal of T, 
doing a distance transform operation [8] for each part in the tree (takes 
time 0(n) time in the number of pixels). 





We use a similar approximation to the random sampling 
method described in Section 3.4. For a given part location 
question qj, we wish to compute expected entropy: 


e § p [ K{u\~e p )} = \x,u t )^{u\~e p ). (13) 

0 P 

This can be done by drawing K samples 0 pl .--0 pK from the 
distribution p(0 p \x, U t ), then computing expected entropy 

E §p [R(U\O p )]^ (14) 

K 

- ut )J2 p ( c \ x ’ u *> Kk) logp(c|x, U\ 0* k ). 

k=l C 

In this case, each sample 6 pk is extracted from a sample Q k 
(Section 3.4) and each p(c\x, 0 pk ) is approximated as a 
weighted average over samples @|...@^. The full question 
selection procedure is fast enough to run in a fraction of a 
second on a single CPU core when using 13 click questions 
and 312 binary questions. 

4.2. Selecting Questions By Time 


The expected information gain criterion (Eq 11) attempts 
to minimize the total number of questions asked. This 
is suboptimal as different types of questions tend to take 
more time to answer than others ( e.g ., part click questions 
are usually faster than attribute questions). We include a 
simple adaptation that attempts to minimize the expected 
amount of human time spent. The information gain crite¬ 
rion IG t (qj) encodes the expected number of bits of infor¬ 
mation gained by observing the random variable Uj. We 
assume that there is some unknown linear relationship be¬ 
tween bits of information and reduction in human time. The 
best question to ask is then the one with the largest ratio of 
information gain relative to the expected time to answer it: 


Qj(t+ 1) = argmax 


IGt(gj) 

E[time(i/j)] 


(15) 


where Eftime(iXj)] is the expected amount of time required 
to answer a question qj . 


5. Implementation Details 

In this section we describe the dataset used to perform 
experiments, provide implementation details on the types 
of features used, and describe the methodologies used to 
obtain pose information for training. 

5.1. Extended CUB-200 Dataset 

We extended the existing CUB-200 dataset [21 ] to form 
CUB-200-2011 [18], which includes roughly 11,800 im¬ 
ages, nearly double the previous total. Each image is anno¬ 
tated with 312 binary attribute labels and 15 part labels. We 


obtained a list of attributes from a bird field guide website 
[19] and selected the parts associated with those attributes 
for labeling. Five different MTurk workers provided part 
labels for each image by clicking on the image to designate 
the location or denoting part absence (Figure 2(b)). One 
MTurk worker answered attribute questions for each im¬ 
age, specifying response certainty with options guessing , 
probably , and definitely. They were also given the option 
not visible if the associated part with the attribute was not 
present. At test time, we simulated user responses in a sim¬ 
ilar manner to [ 3 ], randomly selecting a stored response for 
each posed question. Instead of using bounding box an¬ 
notations to crop objects, we used full uncropped images, 
resulting in a significantly more challenging dataset than 
CUB-200 [22]. 

5.2. Attribute Features 

For attribute detectors, we used simple linear classifiers 
based on histograms of vector-quantized SIFT and vector- 
quantized RGB features (each with 128 codewords) which 
were extracted from windows around the location of an 
associated part. We believe that significant improvements 
in classification performance could be gained by exploring 
more sophisticated features or learning algorithms. 

5.3. Part Model 

As in [8], the unary scores of our part detector are im¬ 
plemented using HOG templates parametrized by a vector 
of linear appearance weights w Vp for each part and aspect. 
The pairwise scores are quadratic functions over the dis¬ 
placement between (x p ,y p ) and ( x q ,y q ), parametrized by 
a vector of spatial weights w Vp ^ Vq for each pose and pair 
of adjacent parts. For computational efficiency, we assume 
that the pose and scale parameters are defined on an object 
level, and thus inference simply involves running a sepa¬ 
rate sliding window detector for each scale and pose. The 
ground truth scale of each object is computed based on the 
size of the object’s bounding box. 

5.4. Pose Clusters 

Because our object parts are labeled only with visibility, 
we clustered images using k -means on the spatial x- and y- 
offsets of the part locations from their parent part locations, 
normalized with respect to image dimensions; this approach 
handles relative part locations in a manner most similar to 
how we model part relationships (Section 3.2.2). Exam¬ 
ples of images grouped by their pose cluster are shown in 
Figure 2(a). Semantic labels were assigned post hoc by vi¬ 
sual inspection. The clustering, while noisy, reveals some 
underlying pose information that can be discovered by part 
presence and locations. 





(a) Information gain criterion (b) Time criterion 

Figure 4. Classification accuracy as a function of time when 
4(a) maximizing expected information gain; and 4(b) minimizing 
amount of human labor, measured in time. Performance is mea¬ 
sured as the average number of seconds to correctly classify an 
image (described in Section 6.1). 

6. Experiments 

6.1. Performance Metrics 

We evaluate our approach using time as a measure of hu¬ 
man effort needed to classify an object. This metric can 
be considered as a common quantifier for different forms 
of user input. Performance is determined by computing the 
average amount of time taken to correctly classify a test im¬ 
age. The computer presents images of the most likely class 
to the user, who will stop the system when the correct class 
is shown. This assumes that the user can validate the cor¬ 
rect class with 100% accuracy, which may not be always 
possible. Future work will entail studying how well users 
can verify classification results. 

6.2. Results 

Using our criteria for question selection (Section 6.1) 
and our time-to-classification metric, we examine the av¬ 
erage classification accuracy for: (1) our integrated ap¬ 
proach combining localization/classification algorithms and 
part click and binary attribute questions; (2) using binary 
questions only with non-localized computer vision algo¬ 
rithms and expected information gain to select questions 
(representative of [3]); (3) using no computer vision; and (4) 
selecting questions at random. We follow with observations 
on how the addition of click questions affects performance 
and human effort required. 

Question selection by time reduces human effort. By 

minimizing human effort with the time criterion, we are 
trading off between the expected information gain from a 
question response and the expected time to answer that 
question. Subsequently, we are able to classify images in 
36.6 seconds less on average using both binary and click 
questions than if we only take into account expected infor¬ 
mation gain; however, the margin in performance gain be¬ 
tween using and not using click questions is reduced. 

We note that the average time to answer a part click ques¬ 


tion is 3.01 ±0.26 seconds, compared to 7.64±5.38 seconds 
for an attribute question; in this respect, part questions are 
more likely to be asked first. 

Part localization improves performance. In Figure 
4(a), we observe that by selecting the next question using 
our expected information gain criterion, average classifica¬ 
tion time using both types of user input versus only binary 
questions is reduced by 33.8 seconds on average. Compared 
to using no computer vision, we note an average reduction 
in human effort of over 40% (68.2 seconds). 

Using the time criterion for selecting questions, the aver¬ 
age classification time for a single image using both binary 
and click questions is 58.4 seconds. Asking binary ques¬ 
tions only, the system takes an additional 20.4 seconds on 
average to correctly classify an image (Figure 4(b)). Us¬ 
ing computer vision algorithms, we are able to consistently 
achieve higher average classification accuracy than using no 
computer vision at all, in the same period of time. 

User responses drive up performance. There is a 
disparity in classification accuracy between evaluating at¬ 
tribute classifiers on ground truth locations (17.3%) versus 
predicted locations (10.3%); by using user responses to part 
click questions, we are able to overcome initial erroneous 
part detections and guide the system to the correct class. 
Figure 5(a) presents an example in which the bird’s pose 
is estimated incorrectly. After posing one question and re¬ 
evaluating attribute detectors for updated part probability 
distributions, our model is able to correctly predict the class. 

In Figure 5(b), we visualize the question-asking se¬ 
quence and how the probability distribution of part loca¬ 
tions over the image changes with user clicks. We note in 
Figure 5(c) that our pose clusters did not discover certain 
poses, especially frontal views, and the system is unable to 
estimate the pose with high certainty. 

As previously discussed, part click questions take on av¬ 
erage less time to answer. We observe that the system will 
tend to ask 2 or 3 part click questions near the beginning 
and then continue with primarily binary questions ( e.g . Fig¬ 
ure 5(d)). At this point, the remaining parts can often be 
inferred reliably through reasoning over the spatial model, 
and thus binary questions become more advantageous. 

7. Conclusion 

We have proposed a novel approach to object recognition 
of fine-grained categories that efficiently combines class at¬ 
tribute and part models and selects questions to pose to the 
user in an intelligent manner. Our experiments, carried out 
on a challenging dataset including 200 bird species, show 
that our system is accurate and quick. In addition to demon¬ 
strating our approach on a diverse set of basic-level cate¬ 
gories, future work will include introducing more advanced 
image features in order to improve attribute classification 
performance. Furthermore, we used simple mouse clicks to 












Initial pred. part locations 



Ql\ Is the throat yellow? 
no (4.547 s) 


Great Crested Flycatcher? no 



Figure 5. Four examples of the behavior of our system. 5(a): The system estimates the bird pose incorrectly but is able to localize the 
head and upper body region well, and the initial class prediction captures the color of the localized parts. The user’s response to the first 
system-selected part click question helps correct computer vision. 5(b): The bird is incorrectly detected, as shown in the probability maps 
displaying the likelihood of individual part locations for a subset of the possible poses (not visible to the user). The system selects “Click 
on the beak” as the first question to the user. After the user’s click, the other part location probabilities are updated and exhibit a shift 
towards improved localization and pose estimation. 5(c): Certain infrequent poses ( e.g . frontal views) were not discovered by the initial 
off-line clustering (see Figure 2(a)). The initial probability distributions of part locations over the image demonstrate the uncertainty in 
fitting the pose models. The system tends to fail on these unfamiliar poses. 5(d): The system will at times select both part click and binary 
questions to correctly classify images. 


designate part locations, and it would be of interest to inves¬ 
tigate whether asking the user to provide more detailed part 
and pose annotations would further speed up recognition. 
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